Integrated Cancer Subtyping using Heterogeneous Genome-Scale Molecular Datasets

Vast repositories of heterogeneous data from existing sources present unique opportunities. Taken individually, each of the datasets offers solutions to important domain and source-specific questions. Collectively, they represent complementary views of related data entities with an aggregate information value often well exceeding the sum of its parts. Integration of heterogeneous data is therefore paramount to i) obtain a more unified picture and comprehensive view of the relations, ii) achieve more robust results, iii) improve the accuracy and integrity, and iv) illuminate the complex interactions among data features. In this paper, we have proposed a data integration methodology to identify subtypes of cancer using multiple data types (mRNA, methylation, microRNA and somatic variants) and different data scales that come from different platforms (microarray, sequencing, etc.). The Cancer Genome Atlas (TCGA) dataset is used to build the data integration and cancer subtyping framework. The proposed data integration and disease subtyping approach accurately identifies novel subgroups of patients with significantly different survival profiles. With current availability of vast genomics, and variant data for cancer, the proposed data integration system will better differentiate cancer and patient subtypes for risk and outcome prediction and targeted treatment planning without additional cost and precious lost time.


Introduction
Genomic and epidemiologic studies over the past decade have generated a wealth of data, including molecular, variant, and clinical data on both individuals and populations that can be leveraged to better understand cancer risk, progression, and outcomes. Subtyping patient disease populations using high-dimensional molecular data has transformed how researchers and clinicians interpret and quantify heterogeneity within a disease. Subtyping has been highly effective in discovering cancer types, tumor histologies, survival rates, treatment planning and responses. Investigation of clinically relevant disease subtypes cannot be achieved by using a single dataset in isolation from others due to the heterogeneity of cancer with multifactorial etiology. Hence, careful integration of diverse data is crucial (e.g. molecular, clinical, environmental data) [1].
The heterogeneity of diseases such as breast cancer is well recognized and gene expression profiling has been used to identify at least four major subtypes: luminal A, luminal B, HER2+ and basal-like [2]. In the past decade, important clinical advances in cancer treatments are attributed to molecularly targeted treatments aiming at specific genes such as estrogen receptor alpha (ER-α), the human epidermal growth factor receptor 2 (HER2), the epidermal growth factor receptor (EGFR), etc [3]. Targeted treatments result in greater efficacy and fewer debilitating or dose limiting side effects. This clearly proves that it is important to identify and appropriately treat each individual disease subtype. However, our current understanding of disease subtypes appears to be very limited. Despite targeted treatment advances, targeted therapies often fail for some patients. For breast cancer, while 20% of tumors overexpress the HER2 oncogene, one-third of these fail to show response to HER2-targeted therapies right from the outset. Research and clinical studies present a similar story for anti-estrogen treatment of ER-α-positive breast cancer, and androgen ablation of androgen receptor positive prostate cancer [4]. Not all patients show an initial response, and from those who do, a significant number will develop resistance. The fact that a substantial fraction of patients with a given subtype of cancer respond very differently to the same treatment, either immediately or later on, means that either: i) the known subtypes are not truly homogeneous and must be further refined, or ii) that subgroups of patients may have different mechanisms of defense against the same tumor type.
Consensus Clustering (CC) [16] is a state-of-the-art approach desired to find a single clustering by reconciling clustering information from various sources or from different runs of the same algorithm. However, CC cannot be used to combine multiple data types with different scales and most of the time the analysis of each data type leads to different results (subgroups) that are hard to interpret. Other machine learning approaches such as iCluster [17], and iClusterPlus [18] addresses the challenge of integration by using statistical models that can simultaneously perform clustering, data integration, feature selection and dimensionality reduction using a probabilistic matrix factorization approach. Though powerful, they are limited by their strong assumptions about the data as well as by the gene selection step necessary to reduce computational complexity. Similarity Network Fusion (SNF) [19] is another state-of-the-art approach that can be used for cancer subtyping by integrating multiple data types. Herein, samples are constructed into separate networks for each data type and fused into one network that represents the full spectrum of the data. However, the unstable nature of kernel-based clustering makes the algorithm sensitive to small changes in molecular measurements or in its parameter settings. Cancer Integration via Multikernel Learning (CIMLR) is another kernel-based approach that adds weights to different data types [30].MaxSilhoutte is a clustering technique based on cluster tightness and separation where each cluster is represented by a so-called silhouette [27]. MaxSilhoutte, however, is not designed to integrate multiple data types, and hence requires the separate datasets to be concatenated for integrative analysis.
Nguyen et. al.'s recent paper [20] has inspired and given us the basis on which we have built a data integration and disease-subtyping framework. They have proposed a novel integrative approach, called Perturbation clustering for data INtegration and disease Subtyping (PINS) that addresses subtype discovery using a single datatype or integration of multiple data types. The method determines the optimal number of clusters and then partitions the samples in a way such that the results are robust to noise and data perturbation. The study integrated multiple quantitative numerical data types (mRNA, methylation, microRNA) that came from different platforms, different scales and different cellular phenomena. Though powerful, the approach proposed here can only be applied on quantitative numerical data types. In this paper, we have proposed a new method that can integrate both qualitative and quantitative numerical data to better identify the cancer subtypes and novel subgroups of patients with significantly different survival profiles.
Cancer being a heterogeneous disease with large genetic diversity even between tumors of the same cancer types, it is common for the patients to have significant differences between their molecular profiles. Hence, majority of the recent studies use integrative approaches that combines multiple types of molecular data such as Methylation, mRNA expression, DNA copy number variation etc. accounting for variations among individuals and thereby achieving more accurate subtyping [21,28,29]. However, because of the noise level of these datasets and the complexity of the disease, the results are not producing significant separation between the subgroups [22]. Therefore, recent studies have proposed to use additional datasets such as somatic variants [21,23], and clinical data [30] in combination with the aforementioned molecular data types as a new source of information. Gligorijevic et. al. has shown that careful integration of different data types can address several challenges as i) stratification of patients with different clinical outcomes, ii) prediction of driver genes, iii) repurposing of drugs treating particular cancer patient groups [21].
In a previous study, we have proposed a cancer subtyping methodology using solely somatic variant data available at TCGA [24]. We were not interested in any clusters that form or disappear due to small changes in the data, but rather for those groupings that remain stable across many clusterings built in the presence of small changes. To identify such clusters, we have generated new datasets by perturbing the original data using a Post Randomization (PRAM) method and reconstructing the clustering. The discrepancy between the original and the perturbed data was used to assess the stability of the clusters. The results have shown that the proposed approach can identify disease subtypes better than the state-of-the-art approaches.
In this paper, we integrated the subtyping approach we proposed in [24] for somatic variants with the subtyping approach defined by two of our authors [20] for mRNA, miRNA and Methylation. We developed a data integration system for cancer subtyping that flexibly integrates both qualitative and quantitative datatypes using existing datasets available at TCGA. We believe that integrating multi-variate heterogeneous datatypes will improve the consistency and actionable information value of the consensus subtypes. Developed framework will be a valuable precision medicine resource for the wider scientific community on other diseases to pursue a multitude of studies that have not been possible due to limitations of existing integrative subtyping methods.

Methods
We analyzed five different cancers available at The Cancer Genome Atlas (TCGA) website (https://tcga-data.nci.nih.gov/): Kidney Renal Clear Cell Carcinoma (KIRC), glioblastoma multiforme (GBM), acute myeloid leukemia (LAML), breast invasive carcinoma (BRCA), and colon adenocarcinoma (COAD). Table 1 shows the basic descriptions of the five datasets we have analyzed. We used mRNA expression, DNA Methylation, miRNA expression and somatic variant data to identify the subtypes for each of the five cancers. Subtyping is first performed on each data in isolation and the obtained results are then integrated to improve the differentiation between subgroups.

Subtyping Qualitative Data
Herein, we have used the somatic variant data to identify the cancer subtypes. The somatic variant data is stored in a binary matrix, where "1" denotes a mutation on the host gene, and "0" denotes the absence of mutation, with the rows and columns corresponding to the samples and genes, respectively. Somatic variants can be defined as an alteration in DNA identified by comparing a normal sample with a tumor sample and generally very sparse since the proportion of variants are minor compared to the whole genome size.
For each of the five datasets, we calculated the pairwise distance between all patients using the Jaccard index. For each patient, the somatic variant profile is represented as a binary vector and the Jaccard index is computed as 11 represents the total number of mutated genes for patients A and B, M 01 represents the total number of genes where patient A has a value of '0' and B has a value of '1', and M 10 represents the total number of genes where patient A has a value of '1' and B has a value of '0'. The Jaccard indexis computed for each pair of patient in the dataset, resulting in a similarity matrix that can be used as an input to any distance based clustering method. To identify the subtypes, we exploit the agglomerative hierarchical clustering using the Ward's method as the linkage criteria as well as the Partition Around Medoids (PAM) clustering. The identified subtypes can be illustrated in a matrix form referred to as the connectivity matrix.

Construction of Data Connectivity-
The input is a dataset, E ∈ R NXM where N is the number of subjects and M is the number of features for each subject. For somatic variants, the pairwise similarity between each pair of subject is computed using the Jaccard similarity measure and stored in a matrix form. We then partition the subjects into k clusters for each value of k ∈ [2..K] using a clustering algorithm. We have used the Agglomerative Hierarchical Clustering and PAM but a number of other classical distance based clustering approaches could be used instead. The input dataset E can be presented as a set of vectors E = e 1 , e 2 , …, e N , where each vector e i represents the features of the i th subject. A partition P k = P 1 , P 2 , …, P k represents a set of subjects that are members of the same cluster. We generate a pairwise connectivity matrix C k ∈ 0, 1 NXN , which can be defined as follows: .K]: e i , e j ∈ P t 0 otherwise (2) Here, the connectivity between two subjects is '1' if and only if they belong to the same cluster and '0' otherwise.

Generating
Perturbed Datasets-One challenge of clustering is the determination of the number of clusters, i.e. the number of subtypes. The proposed approach hypothesizes that the number clusters should be robust with respect to the systemic noise of the features within the population. Hence, we have utilized a perturbation mechanism to add noise to the input data many times and construct connectivity matrices for each perturbed dataset. The original and the average perturbed connectivity matrices are then compared to assess the stability of pair-wise connectivity (identical or different cluster membership) for each pair of subjects. Number of clusters, providing the highest degree of stability with a certain amount of perturbation, is considered to be optimal.
Accordingly, we first developed a perturbation method for discrete and binomial data by employing a post-randomization (PRAM) methodology [25]. PRAM is a perturbative method for disclosure protection of qualitative variables [25]. Applying PRAM on a dataset leads the values of a number of variables to be changed according to a specified probability mechanism. PRAM is commonly used to protect sensitive data files against disclosure by randomization of individual record data with the proper choice of transition probabilities.
As a first step in perturbing data, let ε denote a qualitative variable in the original dataset with K categories, numbered 1, .., K to which PRAM is applied and ε denote the same categorical variable in the perturbed data file. Let P = p kl be a KxK Markov probability matrix defined as; P(ε = l | ε = k), that denotes the probability of the original value ε = k transitioned into a value of ε = l. For a dataset of n records, let ε (r) denote the value of ε for the rth record in the data file.
Given that ε (r) = k, applying PRAM means that the value of ε (r) is drawn from the probability distribution p k1 , …, p kK This procedure can then be applied independently on each record in the datafile. Consider an example where the variable ε represents the somatic variants with "1" denoting a mutation on the host gene, and "0" denoting the absence mutation (hence, the number of categories, K = 2). The Markov probability matrix P can then be defined as follows: In this paper, we have randomized the variants for each subject using equal probabilities for the transitions. A variant will switch from present to absent and vice versa, with the probability of θ, and stay as is with a probability 1 -θ. If the transition probabilities are set too low, the added noise will not perturb the data sufficiently. If the probabilities are too high, the perturbation may significantly change the patterns of the data, causing the subtypes to be indifferentiable due to the added noise. Therefore, the selection of the transition probabilities has an important effect on identifying the hidden subtypes of the data. To determine the transition probabilities, we considered the mutation in each gene as an independent Bernoulli trial. The Bernoulli process applies to discrete stochastic sequences and each component (1,0) designates whether a mutation happened at a specific position. This way, we can use the variance of each Bernoulli trial to determine the transition probabilities of PRAM.
In equation 4, the variance, σ 2 , would correspond to 1 -θ, e.g. if the median variance of a dataset is calculated as 0.02, then the transition probability, θ, is set to 0.98, which means that there would be a 98% probability for any somatic variant (0,1) at E(i, j)to remain the same. This process allows us to construct numerous perturbed versions of the original data.

Construction of Perturbed
Connectivity-To construct the connectivity matrices for each perturbed data, we clustered each perturbed dataset using both hierarchical clustering and PAM with varying values of k ∈ [2..1]. Since true cluster assignments is assumed to be robust with respect to small perturbations, the ideal case would be the individual patient's cluster assignments to remain the same on both original and perturbed datasets for the optimal cluster size, k. Since we have generated many perturbed versions of the original data (say L perturbation datasets) for each cluster k, the overall connectivity matrix, A k can be calculated by averaging the connectivity matrices of each perturbed dataset, G k G k l (i, j) = 1 i f i and j belong to the same cluster 0 otherwise (5) A k = 1 l ∑ l = 1 l G k l (6) Hence, the discrepancy between the original connectivity matrix C k ∈ 0, 1 NXN and the average connectivity matrix of the perturbed data A k can be calculated to measure the stability of each cluster size k. The cluster associated with the minimal discrepancy is then identified as the optimal cluster size.

Subtyping Quantitative Data
Herein, we have used mRNA expression, DNA methylation and microRNA data to identify the cancer subtypes. We have used the method introduced by Nguyen et. al. [20]. Each of these are quantitative numerical datatypes with different scales. Each datatype is first used in isolation to identify the subtypes and the results are then integrated to determine the consensus subtypes. The perturbation methodology used for quantitative data is different from the method used for qualitative data.

Construction of Data Connectivity-The input is a dataset,E ∈ R NXM
where N is the number of subjects and M is the number of features for each subject. We partition the subjects into k clusters for each value of k ∈ [2..K] using the traditional k-means clustering. The connectivity matrices for the quantitative data are then constructed the same way as in qualitative data (See Section 2.1).

Generating Perturbed
Datasets-For the quantitative data, the perturbation is performed by adding Gaussian noise to the original data. We perturb the data with a noise level that has a variance equal to the variance of the data in order to prevent the perturbation from significantly changing the patterns of the data and causing the subtypes to be indifferentiable due to the added noise. The variance is calculated as follows: We Each perturbed data L h is then re-clustered for each cluster size. The perturbed connectivity matrices for quantitative and qualitative data are constructed using the same approach as discussed in Section 2.1.2.

Integration of Connectivity Matrices
Once the connectivity matrices for the optimal cluster size k are generated for each datatype, we then integrate those matrices by the method described below. In the ideal case, different data types should give consistent connectivity between subjects. However, in practice, different data types can give contradictory information. Therefore, we need to rely on the average connectivity between data types in order to partition the samples. The average pairwise connectivity between samples can be calculated as follows:S c = ∑ i = 1 T represents the different datatypes within the dataset. Hence, S c (i, j) will be 0 if i and j are never clustered together , 1 if i and j are always clustered together, and between 0 and 1, if i and j are clustered together in some datatypes.
We refer to S c as the similarity matrix and (1 -S c ) as the distance matrix. The matrix of pairwise distances (1 -S c ) can be directly used by a similarity-based clustering algorithm such as agglomerative hierarchical clustering, PAM or dynamic tree cut to partition the dataset. The framework of the proposed data integration and disease subtyping methodology is illustrated in Figure 1.

Further Splitting Discovered
Groups-At this stage, we attempt to sub-split the discovered subgroups to better identify the clusters. Given that the subgroup identification proposed here is an unsupervised approach, prior information such as patient demographics that may be predominant are missing. The presence of a subgroup can therefore be obscured. In addition, there may be distinct subgroups that share clinically relevant characteristics. For instance the already identified subgroups of breast cancer, Luminal A and Luminal B, are both estrogen receptor positive, which may require the two groups to be further examined to identify the heterogeneity between them. First, we check the agreement between the constructed connectivity matrices of each data type. An entry will be '0' if the pair of subjects, i and j are never clustered together and '1' if they are always clustered together. If the pair is clustered together only within the connectivity matrices of certain datatypes, we consider no agreement between the two subjects. If there is an agreement that exceed the set threshold (e.g., >50%), we consider further splitting the subgroups into clusters.
In order to sub-split the identified subgroups we have used the gap statistics. Gap statistic is a method used to estimate the most possible number of clusters in a partition clustering. We have used the criterion introduced by Tibshirani et. al. [26] that uses the output of any clustering algorithm by comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution. Suppose D r be the sum of the pairwise distances for all points in cluster r and n be the sample size, then; Arslanturk et al. Page 8 W k can be defined as the within-cluster sum of squares around the cluster means. The gap statistic can then be computed as Gap n (k) = E n * log W k − log W k . (11) where E n * denotes the expectation under a sample of size n from the reference distribution.
We have applied the gap statistic only on subgroups that have at least a certain number of subjects (e.g., 30). The subgroup(s) are split into k clusters with varying values of k ∈ [1…K/2]. Note that, if the optimal number of clusters using the Tibshirani criterion is 1, no further splitting would be required. If otherwise, the subgroup would be further split into k clusters. One limitation of further splitting the subgroups is the potential of overfitting. As the within-cluster similarity increases when forming new and finer clusters, it may also lead to fitting the noise. In order to prevent the overfitting, we introduced a regularization term that restricts high number of clusters by reducing the gap ratio each time a new cluster is introduced.

Results
The results of the proposed method using five different datasets is reported in Table 2, where k denotes the optimal cluster size and Cox P denotes the statistical significance between identified subtypes estimated based on the predictive accuracy on the survival time. The subtypes are analyzed using the Kaplan-Meier analysis and their statistical significance is assessed using Cox regression. The integrated results clearly show a better differentiation than the individual data types.
Our results are compared with PINS, CC, SNF, iCluster+ and maxSilhoutte methods. The results have shown that the proposed integration significantly differentiates the identified subtypes for all investigated diseases and outperforms the integrated results of the aforementioned state-of-the-art techniques. Figure 2 (left) shows the Kaplan-Meier survival curves of the proposed methodology using the acute myeloid leukemia (LAML) dataset compared with the survival curves obtained through integration of mRNA, methylation and miRNA data using PINS (center) and CC (right). The proposed integrative clustering with somatic variants, methylation, mRNA and miRNA discovers two patient groups with significantly different survival profiles (p-value = 10 −3 ). In contrast, the integrative clustering without somatic variants discovers four different patient groups with less significant survival profiles (p-value = 2.4×10 −3 ). These survival curves clearly show that incorporating qualitative data (i.e. somatic variants) into the integration process outperforms the subtyping performance. We observed similar performances on the other datasets.

Further Analysis of Discovered Subtypes
Herein, we looked into significant survival differences to identify cancer subtype specific biomarkers. Specifically, we investigated mutations that are abundant in patients within the short-term survival group but not within the long-term survival group and vice versa. For LAML, mutated genes that are abundant in patients within the short-term survival group are TP53, DNMT3A and FLT3 while NPM1 is found to be enriched in long-term survival groups. VHL is mutated in all subgroups of KIRC except one in which all patients survive at the end of the study. Similarly, PBRM1 is found to have high mutation rates on patients with short-term survivals and low mutation rates on patients with long-term survivals. Our results show that GBM subtypes are highly influenced by methylation profiles (See Table 2). The genes identified as biomarkers in GBM are TTN, TP53, PTEN and EGFR. The mutation rates of those genes are significantly higher in patients that have short-term survival rates. IDH1 and ATRX are highly enriched in patients with long-term survival. Parsons et. al. have indeed shown that patients with IDH1 mutation usually have a significantly longer survival [27] and IDH1 can be used as target for therapy and drug development [28]. High mutations in BRAF and p53 were determined in patients with short-term survival of COAD. Contrary to recent studies, no significant association was found between KRAS and short-term survival [29]. We have further compared the BRCA subtypes with known targets. Out of 172 patients, there are 34 ER-negative (ER-), 134 ER-positive (ER+) and 4 not evaluated. 27 out of 34 ER-patients are found to have a short-term survival, whereas the ER+ patients are uniformly distributed across the four clusters identified.

Summary and Conclusion
In this paper, we have identified cancer subtypes using somatic variant, mRNA, methylation, and miRNA data types. This method can be applied on any quantitative or qualitative dataset for the purpose of disease categorization, patient subgroup detection, response to treatment identification, drug development and repurposing, or biomarker detection. This method can be applied on any dataset for the purpose of disease categorization, patient subgroup detection, response to treatment identification, drug development and repurposing, or biomarker detection. The method scales well to high dimensional data. However, the time complexity is higher as compared to classical approaches due to repeated perturbations. This can be resolved by performing the computations in parallel. Another limitation of the proposed method is that all data types are treated equally in determining subtypes, which may not always be appropriate. For instance, studies have shown that methylation plays a major role in determining the GBM subtypes. One way to address this limitation is to combine the connectivity matrices in a weighted manner. Future work includes: i) incorporating different mutation types (silent, missense, nonsense, etc.), classifications (SNP, insertion, deletion, etc.) and counts into the proposed disease subtyping method ii) incorporating clinical data into the integration process to examine the significance of different survival profiles and iii) utilizing the identified biomarkers to measure pathway deregulations, which would justify the application of certain therapies and customize treatment plans for individuals. Kaplan-Meier survival curves of integrative genomic data clustering using proposed approach (left), PINS (center) and CC (right).